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SYMBOLS 


a 1 b 

a 

divides 

b 

afb 
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does not 

divide 

g 

an 

element 

of G 


gcd 

G 

i 

j 

k 

L 


greatest common divisor 
a group 

a subscript to X^, or just a general subscript 
a subscript to M ^ , or just a general subscript 

a positive integer, or a generating element of the cyclic group 

a positive number, or the number of levels of k-apart interconnection 
network 

a positive integer, or the number of routing requires in a k-apart 
interconnection network 


mod N modulo N 

the j th memory module of an N-module array memory 
a multiplicative cyclic group with N - 1 elements 
number of elements in the vectors X 


M. 

J 


“n 

n 

N 


a positive integer, the number of memory modules in an array 
memory, or the number of registers in an array register 

p a positive integer, or the separation distance between the ith and the 

(i 4- l)th elements of a p-ordered vector; p is also called the 
skip distance 

P a prime number 

pq two positive integers p and q to denote a pq-ordered vector; p is 

the skip distance and q is the separation distance 

r primitive root 

R number of processors 

t order of g 

TN transportation network 

iii 


II mi III 


a positive integer equal to 2^ 

a one-dimensional vector consists of n elements 
the ith element of an n-element vector X 
contains in 

smallest integer greater than or equal to x (ceiling of x) 
smallest integer less than or equal to x (floor of x) 



NASF TRANSPOSITION NETWORK: 


A COMPUTING NETWORK FOR UNSCRAMBLING p-ORDERED VECTORS 

Raymond S. Lim 
Ames Research Center 


SUMMARY 


This paper presents a tutorial description of a transportation network 
(TN) proposed by the Burroughs Corporation for the Numerical Aerodynamic Simu- 
lation Facility (NASF) . The description is presented from the viewpoints of 
design, programming, and application. The TN is a programmable combinational 
logic network that connects 521 memory modules to 512 processors, where 
gcd(521,512) = 1. The primary purpose of the TN is to transpose (or unscram- 
ble) p-ordered vectors to 1- ordered vectors in one cycle. For unscrambling 
pq-ordered vectors, the TN speed is degenerated to several cycles. The TN 
design, which is evolved from the Swanson network, is based upon the concept 
of cyclic groups from abstract algebra and primitive roots and indices from 
number theory. The design can be implemented by one level of barrel switch 
plus a fixed wiring pattern and its inverse. The connection of this fixed 
wiring pattern is from p to m according to k m = p (mod N) , where k is a 
primitive root of the prime N, m is the index of p relative to k, anjd p 
is an element of the cyclic group of order N - 1 generated by k. The pro- 
gramming of the TN is very simple, requiring only 20 bits: 10 bits for offset 

control and 10 bits for barrel switch shift control. This simple control is 
executed by the control unit (CU) , not the processors. For this reason, any 
memory access by a processor must be coordinated with the CU and wait for all 
other processors to come to a synchronization point. These wait and synchro- 
nization events can be a degradation in performance to a computation. The 
TN application is for multidimensional data manipulation, matrix processing, 
and data sorting, and can also perform a perfect shuffle. Unlike other more 
complicated and powerful permutation networks, the TN cannot, if possible at 
all, unscramble non-p-ordered vectors in one cycle. 


I. INTRODUCTION 


In the preliminary studies (refs. 1, 2), the Burroughs Corporation pro- 
posed a baseline computer system for the flow model processor (FMP) of the 
NASF. This proposed computer system is similar to the ILLIAC IV (ref. 3) 
parallel computer except that a few innovative ideas were incorporated. One 
of these innovations, the TN, is a bidirectional programmable combinational 
logic network that can be used to perform conflict-free access to various 
slices of multidimensional data (such as rows, columns, diagonals, and files) 
and subsequent transposition of these data for processing. The end result is 
that the TN provides a very simple and efficient method for the FMP to store 



its data in memory and allows parallel algorithms to be executed at a rate 
matched to the array processor rate. 

The TN is one method of solving the traditional memory-processor connec- 
tion problem in parallel array processors, but there are others (refs. 4 
through 7). The tradeoffs in the design of memory-processor connection net- 
works are basically complexity versus flexibility. A full crossbar switch, 
which is very complex and costly, can unscramble any data permutation and thus 
allow the system to store data without regard to subsequent unscrambling 
requirements. The TN, which is a low-cost and simple network, cannot unscram- 
ble every data permutation in one cycle. Therefore, because several cycles 
are needed to unscramble the data, certain data storage allocations can result 
in processing delay. 

In the FMP, the TN connects an array of N = 521 memory modules, called 
the extended memory (EM), to an array of R = 512 processors, where N is 
selected as the smallest prime number greater than R. This memory-processor 
connection is shown in figure 1, which is a block diagram of the FMP. For 
certain types of data storage allocations in memory, such as the IBM FORTRAN IV 
method that requires that multidimensional data be stored in column major 
order, p-ordered vectors arise naturally from these storage allocations. A 
p-ordered vector, as defined by Swanson (ref. 8), is an n-element vector in 
which the (i + l)th element is spaced p positions to the right of the 
ith element modulo N. The primary purpose of the TN is to transpose (or 
unscramble) p-ordered vectors to 1-ordered vectors in one network cycle. For 
unscrambling quasi-p-ordered vectors (called pq-ordered vectors) , the TN 
speed is degenerated to several network cycles. 
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Figure 1.- Block diagram of FMP showing memory-processor 
connection using TN. 







The design concept of TN is based upon Swanson’s paper (ref. 8) concerning 
the use of k-apart interconnection networks to unscramble p-ordered vectors. 

In his paper, Swanson shows that a p-ordered vector can be unscrambled to a 
1-ordered vector by using a k-apart network requiring m < N - 2 routings if 
k is a primitive root of the prime N, m is the index of p relative to k, 
and p is an element of the cyclic group generated by k. G. Barnes of 

the Burroughs Corporation (ref. 1) observed that these N - 2 routings can be 
reduced to L = Log 2 N routings by using L levels of k V -apart networks 
where v = 2 1 and i = 0,1,2, . . . , L - 1. Furthermore, he observed that these 
L networks can be implemented by one level of barrel switch plus a fixed 
\tfiring pattern and its inverse. The connection of this fixed wiring pattern 
is from p to m according to k m = p (mod N) . The result is a very high- 
speed and low-cost TN, but with limited performance when compared to any other 
permutation networks presently known. The programming of the TN is simple, 
requiring only 20 bits: 10 bits for offset control and 10 bits for barrel 

switch shift control. 

In this paper, a tutorial description of the TN is presented from the 
viewpoints of design, programming, and application in seven sections. Sec- 
tion II gives a brief description of the TN in the FMP and a description of 
p-ordered and pq-ordered vectors. Section III reviews cyclic groups from 
abstract algebra and primitive roots and indices from number theory. Sec- 
tion IV gives a summary of Swanson's paper. Section V describes the TN design 
using Log 2 N k-apart interconnection networks. Section VI describes the TN 
design using one level of barrel switch. Finally, section VII describes TN 
programming and application. 


II. p AND pq-ORDERED VECTORS 


This section begins with a brief description of the TN and the FMP, and 
then shows how two- and three-dimensional data might be stored in such an 
architecture. There are many methods for storing data in multimodule memory 
systems (refs. 9 through 11). In this paper, only the FORTRAN column major 
order method is of interest because the FMP will use an extended FORTRAN as 
its high-level programming language. For the column major order method, it 
will be seen that p-ordered and pq-ordered vectors arise naturally from the 
FMP storage structure. The definition of p-ordered and pq-ordered vectors 
depends on the number of memory modules, and it is found that there are some 
advantages in storing data and unscrambling fetched vectors when the number 
of memory modules is restricted to be a prime number. 

The FMP block diagram is shown in figure 1. This FMP architecture is 
similar to the ILLIAC IV except that a TN is used to connect 521 memory mod- 
ules to 512 processors. The exact interconnection pattern is programmable and 
is controlled by the CU. There are no connections among the 512 processors, 
and any data exchanges between processors must be done by way of the EM using 
the TN. Once the CU sets up a particular interconnection pattern in the TN, a 
single read (or write) instruction by each processor will fetch (or store) a 
word from (to) its connected EM module. The net result is that a vector of 
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data can be fetched (or stored) in one memory access. The fetched vector from 
EM, prior to the TN unscramble, can be a p-ordered, a pq-ordered, or any per- 
muted vector. In the following, the concept and the definition of p-ordered 
and pq-ordered vectors are described. Instead of using R = 512 and N = 521 
for the description, a simpler model consists of R = 8 and N = 11 will be 
used. Note that 11 is the smallest prime number greater than 8. 

Let X = (Xq Xi X 2 ... X n _2 X n _i) be a one-dimensional vector, or 
simply just a vector, with n elements. Let X ± , i = 0,1,2, . . ., n - 1 
denote the ith element of X. In a computer, X is normally stored in 
N registers. In this case, one can say that associated with X^ is a posi- 
tion number j, j =0,1,2, . . . , N - 1, where N > n. For N = n, X^ is 
assigned as follows: 

j : 0 1 2 3 4 . . . N-2 N-l 

X.: X 0 X 2 X 2 Xo X 4 . . . X X , 

x u i z d h n-2 n-l 

Swanson defines a p-ordered vector X as an n-element vector, where 
X±+l is spaced (or skipped) p positions to the right of X^ modulo N. In 
this paper, p is called the skip distance of X. In view of this definition, 
the above assignment of X^ to j is a 1-ordered vector because X-l+i is 
spaced one position to the right of X^ modulo N. Other examples of 
p-ordered vectors are illustrated below: 

Example 2-1 . N = 11, n = 11, p = 3, offset = 0 

j: 01234567 3 9 10 

X.: X 0 X 4 X 8 X 5 Xg X 2 X 6 X 10 X 3 X 7 

Example 2-2 . N = 11, n = 8, p = 5, offset = 2, * = don T t care 
j: 0123456789 10 

X ± : X 4 X 2 X 0 * X 7 X 5 X 3 X x * * X 6 

In the two examples above, the offset is defined as the location of the 
first element (Xq) of the n-element vector with respect to the first position 
of the register (or memory module) array. With these examples, Swanson’s 
definition (ref. 8, p. 1107) for a p-ordered vector can be stated. 

Definition : Let j = 0,1,2, . . . , N - 1, p = 1,2,3, . . ., N - 1, 
i = 0,1,2, . . . , n - 1, N > n, and offset = 0. An 
n-element vector X is p-ordered if the positions j of 
its elements X^ are described by 

pi (mod N) = j (2-1) 
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From linear congruence in number theory, equation (2-1) can be written as 

pi E j (mod N) (2-2) 

and there exists a unique solution for the unknown i for all values of j 
if and only if p is relatively prime to N (ref. 12, p. 84). Note that if 
N is prime, then all p-ordered vectors can be defined for 

p = 1,2, . . . , N - 1. In the FMP, there are 512 processors. Therefore, the 
number of EM modules N is selected to be 521, which is the smallest prime 
number greater than 512. As a result, there is more flexibility in the manner 
in which the data can be stored, since all p-ordered vectors can be 
unscrambled. 

Now that a p-ordered vector is defined, let’s examine how p-ordered vec- 
tors arise naturally from data stored in memory modules. Consider the storage 
of the 4x4x4 three-dimensional data set of figure 2 in N - 11 memory 
modules, as shown in figure 3. In figure 2, an element in a plane, say 
a 23 i, is simply written as 231. In figure 3, Mj is used to denote the memory 
modules for j = 0,1,2, . . ., 10, and i is used to denote the memory 
address within a module. Also, assume that the computer system has eight pro- 
cessors and all are assigned to compute on this data set. The memory access- 
ing requirement by each processor will be described later in section VII, 
whereas in this section, only the illustration of p-ordered and pq-ordered 
vectors (as defined later) is of interest. 

The natural formation of p-ordered vectors can be seen when the proces- 
sors are computing in the k direction. In this direction, data from the 
IJ planes, one plane at a time, are required. As can be seen from figures 2 
and 3, the fetching of any column or any two consecutive columns i and i 4- 1 
are 1-ordered vectors with different offsets with respect to Mg , the first 
memory module. This is the case because the data are stored in column major 
order. As an example, the fetched vector consisting of columns 3 and 4 from 
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Figure 2.-A4x4x4 three- 
dimensional data set. 


of a 4 x 4 x 4 data set in 
11 memory modules, Mj is memory 
module number and i is a location 
within a memory module. 
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plane 1 is a 1-ordered vector with offset = 8 (the leading element is 131) and 
is shown below: 

M : 0 1 2 3 45678 9 10 

J 

X.: 431 141 241 341 441 * * * 131 231 331 

i 

The fetched vector of any row in any plane is a 4-ordered vector because 
the number of elements in a column is four, which is the separation distance 
of the row elements in a column major order storage. As an example, the 
fetched vector of row 2 in plane 1 is a 4-ordered vector with offset = 1 (the 
leading element is 211) as shown below: 


M. 

3 

: 0 

i 

2 

3 

4 

5 

6 

7 

8 

9 

10 


X. 

l 

• 

211 

241 

* 

J- 

221 

A 

A 

/V 

231 

* row 

2 

X. 

1 

: 111 

141 

JL 

JL. 

✓v 

121 

* 

JTj 

JL 

131 

A 

* row 

1 

However, if 

an attempt 

is made 

to 

fetch 

both 

rows 

1 

and 

2 from plane 1 

simuL 


taneously, there is a memory access conflict in memory module Mj because 
elements 211 and 141 are both stored in Mj . In this case, the row vector 
consisting of rows 1 and 2 must be formed by two memory fetches through the TN. 
In general, memory access conflict exists if the memory module address of two 
elements is equal. Section VII gives equations for calculating element 
addresses in a three-dimensional data set and also describes how the skip 
distance p can be calculated. 


The concept of a pq-ordered vector is derived from the concept of a 
p-ordered vector. The natural formation of pq-ordered vectors can be seen 
when the processors are computing in the J direction. In this direction, 
data from the IK planes, one plane at a time, are required. Referring to 
figures 2 and 3, the fetched vector consists of column 3 = (142 242 342 442; 
and column 4 = (141 241 341 441)^ in plane 4 is a pq-ordered vector with 
p = 1, q = 2, and offset = 6, as shown below. Also shown below is a 
pq-ordered vector with p = 7 and q = 5. 


M_. : 0 1 2 3 4 5 6 7 8 9 10 pq 

X. : * 141 241 341 441 * 142 242 342 442 * 1 2 

X. : 431 422 * 411 441 432 * 421 412 442 * 7 5 

i 


T 

This vector is a result of fetching column 1 = (441 431 421 411) and 
column 2 = (442 432 422 412) T from plane 1 when computing in the I direc- 
tion. For unscrambling pq-ordered vectors, the TN cannot perform the task in 
one cycle. In the above examples, two memory fetches and unscramblings are 
required. In general, the total number of memory fetches and unscramblings 
is equal to the number of groups within the vector X. Each group within X 
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is a p-ordered vector, and the separation distance between groups is q, For 
this reason, pq-ordered vector fetching sometimes is called periodic fetching 
by groups. 

The concept and natural formation of p-ordered and pq-ordered vectors 
have been presented in this section. The 4x4x4. three-dimensional data 
set in figure 2 and its memory module storage allocation scheme shown in 
figure 3 are only for illustrative purposes. In actual practice, the data 
set and the storage allocation scheme may be different. The purpose here is 
to conjecture that the TN, as described later in the design sections, can be 
used to unscramble a p-ordered vector to a 1-ordered vector in one cycle and 
to unscramble a pq-ordered vector to a 1-ordered vector in several cycles. 

The exact number of cycles is dependent on such factors as data storage allo- 
cation schemes, memory access conflicts, and the number of periodic groups 
within a vector. 


III. CYCLIC GROUPS, PRIMITIVE ROOTS, AND INDICES 


In the design of the TN, the concept of cyclic groups, primitive roots, 
and indices are used. For this reason, a brief review of these concepts 
necessary to understand the TN design is presented. In summary, a primitive 
root in number theory is a special case of a generator of a cyclic group in 
abstract algebra. The theory of primitive roots and indices can be used to 
solve certain types of congruence equations in number theory. The reader, if 
so desired, can skip this section and proceed to the next section where the 
description of the Swanson network is presented. 


Cyclic Groups 

From abstract algebra, a set is a collection of elements with some com- 
mon properties that can be used to determine whether an element does or does 
not belong to the set. A group G is a nonempty set of elements and a binary 
operation on the set (i.e., the set is closed under the group operation) such 
that the operation is associative, there is a unique identity element for the 
operation, and each element has an inverse element for the operation. If the 
operation is commutative, then G is called a commutative (or abelian) group. 
If the number of elements in G is finite, then G is called a finite group. 
Otherwise, G is called an infinite group. The number of elements contained 
in G is called the order of G. The binary operation of G is often 
written as multiplication. If the operation is written as multiplication, 
then G is called a multiplicative group. In the design of the TN, only 
finite multiplicative groups will be used. 

Let g be an element of G and e be the identity element of G. The 
order of g is the smallest positive integer t such that g t = e. Note 
that if G is finite, every element g has an order and the order of g 
divides the order of G. 
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(3-1) 


A group G is called cyclic if it contains an element g such that 
every element h of G can be expressed as 



for some integral exponent £, positive, negative, or zero* Such an element 
g of a cyclic group G is called a generator of G. Note that g is a 
generator of G (and G is cyclic) if and only if the order of g is equal 
to the order of G. If G has N - 1 elements generated by g raodulor N, 
then the generator g is called a primitive element of G, or a primitive 
generator. The following examples will illustrate the concept. 


Example 3-1. 


The number 3 is a generator of the cyclic group 
G - (1,2, 3, 4, 5, 6 ) under multiplication modulo 7, since 
3 1 = 3, 3 2 = 2, 3 3 = 6 , 3 4 - 4, 3 5 = 5, 3 6 = 1 


Example 3-2 . The number 2 is a generator of the cyclic group 

G = (1, 2, 3 ,4 , 3, 6 , 7, 8 , 9 , 10) under multiplication modulo 11, 
since 2 1 = 2, 2 2 = 4, 2 3 = 8 , 2 4 = 5, 2 s = 10, 2 6 = 9, 

2 7 = 7, 2 8 = 3, 2 9 = 6 , 2 1 0 = 1 


Example 3-3 . 


The number 4 is a generator of the cyclic group 
G = (1,2,4) under multiplication modulo 7, since 
4 1 =4, 4 2 = 2, 4 3 = 1 


Let N be a positive integer and let M-^ consist of all the positive 
integers that are less than N and relatively prime to N. Then forms 

a group under the binary operation multiplication modulo N. If N is 
prime, then M N contains all the integers from 1 to N - 1 and, therefore, 
has order N - 1. Furthermore, if N is prime, is cyclic and thus has 

at least one generator. In the above two examples, M 7 and Mu are two such 
cyclic groups generated by 3 and 2, respectively. In the TN design, N is 
chosen to be 521 and 3 (the smallest positive integer generating M 521 and 
also a primitive root of 521 as defined later) is chosen as the primitive 
generator to generate all the integers from 1 to 520 modulo 521. This is to 
ensure that all p-ordered vectors, for p = 1,2, . . ., 520, can be unscram- 
bled in one cycle. The design of the Swanson network is based upon the con- 
cept of cyclic groups and linear congruences as described later. 


Primitive Roots 

The concept of primitive roots is a special case of the concept of cyclic 
group generators. Let a group G have N - 1 elements. If g of G is a 
primitive generator, then g is a primitive root of N. As previously 
described, the discussion of cyclic groups is a topic in group theory in 
abstract algebra (refs. 13 and 14), whereas the discussion of primitive roots 
is a topic in number theory (refs. 12, 15, and 16). A generator in G is 
one that generates a cyclic group (or subgroup) of G. The generator that 
generates a cyclic group of order N - 1 modulo N is called a primitive ele- 
ment (or primitive root) of G. Thus, primitive generators and primitive 
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roots are the same. In this section, the concept of primitive roots from 
number theory is briefly reviewed, starting with Euler’s theorem. 

In number theory (refs. 12, 15, and 16), Euler’s phi-function (j)(N), for 
N > 1, denotes the number of positive integers not exceeding N that are 
relatively prime to N. If N is a prime number, then every positive integer 
less than N is relatively prime to it; whence 4> (N ) = N - 1. On the other 
hand, if N > 1 is composite, then N has a divisor d such that 1 < d < N. 
It follows that there are at least two integers among 1,2,3, . . ., N that 
are not relatively prime to N, namely, d and N itself. As a result, 

<f> (N ) < N - 2. This proves: for N > 1, 

<f>(N) - N - 1 if and only if N is prime (3-2) 

If N is not prime, the value of <j)(N) can be obtained either analyti- 
cally or by table lookup (ref. 17). In reference 17 (pp. 840 to 843), (N) 

is listed for N ranging from 1 to 1000. One useful property of cf)(N) is 
given below without proof. 

Euler’s theorem : If a and N are positive integers and gcd (a, N) = 1, 

then eft = 1 modulo N. 

In view of Euler’s theorem, it is known that a^^) = 1 modulo N whenever 
gcd (a, N) = 1. However, there are often powers of a smaller than a^CN) 
that are congruent to 1 modulo N. This leads to the definition of the order 
of a modulo N (in older terminology: the exponent to which a belongs 

modulo N) : 

Orde r definition : Let N > 1 and gcd (a, N) = 1. The order of 

a modulo N is the smallest positive integer k 
such that = 1 modulo N. 

As an example, consider the successive powers of 2 modulo 7 as follows: 

2 1 =2, 2 2 e 4, 2 3 El, 2 4 e 2, 2 5 =4, 2 6 = 1,. . . . 

from which it follows that the integer 2 has order 3 modulo 7. This example 
also shows that 2^ = 1 modulo 7 whenever k is a multiple of 3. This leads 
to a theorem for finding the order of an integer a. 

Order theorem : Let the positive integer a have order k modulo N. 

Let h > k, then a* 1 E 1 modulo N if and only if k 
divides h; in particular, k divides <f>(N). 

This theorem expedites the computation of the order of an integer a modulo N. 
Instead of considering all powers of a, the exponents can be restricted to 
the divisors of <j>(N). 


The order theorem will lead directly into the definition of primitive 
roots. Note that the order k of an integer a modulo N is necessarily a 
divisor of (j>(N). The largest of these divisors is cj> (N) itself. Integers 


9 



of order <f>(N) modulo N are called primitive roots of N. Note that in 
general there may be no primitive roots of N. However, in the case that N 
is a prime, primitive roots of N always exist. The formal definition of 
primitive root is stated as follows: 

Primitive roo t definition: If gcd (a, N) = 1 and a is of order <j>(N) 

modulo N, then a is a primitive root of N. 

In other words, N has a as a primitive root if a (j)(N) _ ]_ modulo N, but 
cfc t 1 modulo N for all positive integers k < Thus, a primitive root 

of the positive integer N is an integer that has the largest possible order. 

This means that the powers of a primitive root modulo N will generate all 

the integers not exceeding N. From this analysis, it can be seen that primi- 

tive roots are a special case of generators (primitive) of cyclic groups. As 
indicated earlier in example 3-2 (cyclic groups), 2 is a primitive generator 
for the cyclic group G = (1,2,3,4,5,6,7,8,9,10) modulo 11. In number theory, 
2 is a primitive root of 11 because <f)(H) = 10, 2 10 = 1 modulo 11, 2^ t 1 
modulo 11 for k < 10, and thus the powers of 2 generate all the elements 
of G. 


In the study of number theory today, it is not known whether or not there 
exist primitive roots for all integers. However, it is known that a positive 
integer N has primitive roots if and only if N is 2,4, a power of an odd 
prime, or twice a power of an odd prime. Also, there is no known simple and 

efficient algorithm to find the primitive roots other than by the use of the 

definition. In general, if a is a primitive root of N, then any other 
primitive root of N is found among the member of the set 
(a, a 2 , . . . a<KN)). it is known, however, that primitive roots do exist, 
the exact number of primitive roots are given by the following theorems, 
starting with the congruence theorem first. 

Congruence th eorem: Let gcd (a y N) = 1 and let a 1 ,a 2 , . . . a . . 

be the positive integers less than N and ^ ' ' 

relatively prime to N. If a is a primitive 
root of N, then a, a 2 , . . a<KN) are con- 
gruent modulo N to aij,a 2 jy • • /vr\ in some 

order. * (N) 

The congruence theorem leads immediately to the next! two theorems. 

Primitive roo t theorem: If N has a primitive root, then it has 

exactly <f> [<j> (N) ] of them. 

Prime primiti ve root_Jtheorem: If N is a prime, then there are exactly 

<f>(N - 1) incongruent primitive roots of 
N. 

The above theorems will be illustrated by two examples. 
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Example 3-4 . Let N = 11. There are exactly (J) (10) = 4 primitive roots 
of 11. The smallest primitive root of 11 is 2 (by table 
lookup). The other three primitive roots of 11 must be 


among the member 
as shown below: 

of the set 2 1 , i = 

1,2, . . ., 10 modulo 11 

2 1 = 2 , 

2 2 e 4, 

N3 

00 

III 

00 

2 4 e 5, 

2 5 e 10 

2 6 e 9, 

2 7 = 7, 

2 8 e 3, 

2 9 = 6, 

2 10 = ! 


By tedious calculation, these three primitive roots of 11 
are 6 , 7, and 8 . The 10 powers of the primitive root 2 
presented above formed the cyclic group and there- 

fore are congruent in some order to the cj> ( 11 ) numbers in 
the set (1,2,3, . . ., 10 ). 

Example 3-5 . Let N = 9, which is not a prime number. There are exactly 
<K<f>(9)) = (p ( 6 ) ~ 2 primitive roots of 9. The smallest 
primitive root of 9 is 2 (by table lookup). The other 
primitive root can be found among the six powers of 
2 modulo 9 as shown below: 

2 1 =2, 2 2 = 4, 2 3 E 8 , 2 4 E 7, ■ 2 5 E 5, 2 6 = 1 

By tedious calculation, this primitive root of 9 is 5. 

Now the integers less than and relatively prime to 9 are 
1, 2, 4, 5, 7, and 8 , and these numbers are congruent in 
some order to the six powers of 2 presented above. 

In the TN design, a prime number N = 521 is selected. There are 
$(520) = 192 primitive roots to choose from as a primitive generator to gen- 
erate the cyclic group M 521 = (1,2, . . ., 520). In practice, the smallest 
primitive root is selected as the generator. For N = 521, 3 is the smallest 
primitive root. Tables of the smallest primitive root r of the prime 
number p can be found in references 12 and 17. In reference 12 (p . 327), 
r is given for 2 < p < 1000. In reference 17 (pp. 864 to 869), r is given 
for 3 < p ^ 9973. Other than these tables, there are no known algorithms 
today, for example, to find the 192 primitive roots of N = 521 other than 
essentially using the stated definition. It is to be noted, however, that 
2 is not a primitive root of N = 521 because 2 260 = 1 modulo 521, which 
violates the primitive root definition, since 260 < $(521) = 520. A method 
of calculating 2 260 , an old and undocumented method in number theory, is 
shown in appendix A. As will be described later, this method also will be 
used as the first high-speed implementation method of the Swanson network. 


Theory of Indices 

The theory of indices is an old and neglected topic in number theory. It 
is analogous to that of logarithms, where the primitive root plays the part 
similar to that of a base of a logarithm. The major difference is that 
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indices are computed using the theory of congruences. In the TN design, the 
barrel switch implementation of the Swanson network can be described most 
simply by using the theory of indices. 

Let N be a positive integer. Two integers a and b are said to be 
congruence modulo N, symbolized by 

a = b (mod N) (3-3) 


if N divides the difference a - b. Let r be the primitive root of N. 
From the congruence theorem, the first <j>(N) powers of r, 


r 


r 4>(N) 


are congruent modulo N, in some order, to those integers less than N and 
relatively prime to it. Hence, if a is an arbitrary integer relatively 
prime to N, then a can be expressed in the form 

a = r^ (mod N) (3-4) 

for a suitable choice of k, where 1 < k < 4>(N). The exponent k is called 
the index of a relative to r, and this leads to the formal definition on 
the concept of index. 

Index definition : Let r be a primitive root of N. If gcd (a,N) = 1, 

then the smallest positive integer k such that 
a = r^ modulo N is called the index of a relative 
to r . 


The standard notation for the index of a relative to r is ind r a or, if 
no confusion is likely to occur, ind a is used. Clearly, 1 < ind a < <j>(N) 
and r 

ind a 

r r = a (mod N) (3-5) 

Note that the definition of index is meaningless unless gcd (a,N) = 1. 
The following example will illustrate the concept of index. 


Example 3-6 . The integer 2 is a primitive root of 11 and the 10 powers 
of 2 are listed in example 3-4. From this list a table 
of indices can be prepared as follows : 


a 

1 23456789 10 

ind 2 a 

10 18249 736 5 


It follows that 


ind 2 l = 10, ind 2 2 = 1, . . . , ind 2 10 - 5 
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With indices, as with logarithms, multiplication, division, evolution, 
and involution are replaced by addition, subtraction, multiplication, and 
division, respectively. These concepts are summarized in the following the- 
orem without proof. 

Index theorem 1 : If N has a primitive root r and ind a denotes the 

index of a relative to r, then 

(1) ind (ab) = ind a + ind b modulo <j>(N) 

lr 

(2) ind a = k ind a modulo cf> (N) for k > 0 

(3) ind 1=0 modulo <f>(N), ind r = 1 modulo <J>(N) 

The theory of indices can be used to solve certain types of congruences. 
For example, consider the binomial congruence 

X k = a modulo N , k > 2 (3-6) 

where X is the unknown, a is an integer with gcd(a,N) = 1, and assume N 
has r as a primitive root. By property (2) of index theorem 1, this con- 
gruence is equivalent to the linear congruence 

k ind X = ind a modulo <J>(N) 

If d = gcd (k,cf)(N)) and d x ind a, there is no solution. But if d | ind a, 
then there are exactly d incongruent solutions. This leads to the following 
two theorems. 

Index theorem 2 : Let N be an integer with primitive root r and 

gcd (a,N) = 1. Then the congruence X^ 1 = a modulo N 
has a solution if and only if 

a^(N)/d = i modulo N 

where d = gcd (k,(f>(N)); if it has a solution, there are 
exactly d solutions modulo N. 

Index theorem 3 : Let N be a prime and gcd (a,N) = 1. Then the con- 
gruence X^ 1 E a modulo N has a solution if and only if 

(N-l) / d _ . , - XT 

a =1 modulo N 

where d = gcd (k,N - 1) . 

To conclude this section, two examples are given below to illustrate 
these concepts. 


13 



Example 3-7 . Solve the binomial congruence 

7X 3 = 3 (mod 11) 

A table of indices using the primitive root r = 2 is 
presented in example 3-6. The gcd (k,<f>(N)) = gcd 
(3,10) = 1, and 1 divides ind 2 a = ind 2 3 = 8. So there is 
exactly one solution. This solution is found as follows: 

ind 2 7 + 3 ind 2 X = ind 2 3 (mod 10) 

7+3 ind 2 X = 8 (mod 10) 

3 ind 2 X = 1 (mod 10) 

At this point X must be found by direct computation. By 
direct substitution of ind 2 X, for X = 1,2,3, . . .,10 
into the last congruence, only X = 7 is satisfied. Thus 
X = 7 is the solution. 

Example 3-8 . Solve the binomial congruence 

3X 4 = 9 (mod 11) 

The gcd = d = (4,10) = 2 and d|ind 2 9 = 6, so there are 
two solutions. Now, 

ind 2 3 + 4 ind 2 X = ind 2 9 (mod 10) 

8+4 ind 2 X = 6 (mod 10) 

4 ind 2 X = -2 (mod 10) 

= 8 (mod 10) 

Since gcd (4,8,10) = 2, then 
2 ind 2 X = 4 (mod 5) 

Since gcd (2,5) = 1, then 

ind 2 X = 2 (mod 5) 
or 

ind 2 X = 2,7 (mod 5) 

The solutions are X = 4 and 7. 
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IV. SWANSON NETWORK 


The Swanson network (ref. 8) is a simple interconnection network that can 
be used to unscramble p-ordered vectors. Its design is based on a so-called 
k-apart interconnection network where k is a primitive root of N, and N 
is the number of memory modules or the number of registers in a register array 
depending on the context of the 4 escr iption. Because of its simplicity, the 
Swanson network requires m iterative routings to unscramble a p-ordered 
vector. The value m is related to k and p by k m = p (mod N) . If N is 
prime, then at most m = N - 2 routings are required. 


Let the n-element vector X, after it is fetched from a memory system 
consisting of N modules, be stored in N registers, where obviously 
n < N. Let the elements of X be X^, i= 0,1,2, . . ., n - 1. Let j be 
the positional index of N, and j = 0,1,2, . . . , N - 1. The vector X is 
p-ordered if the positions of its elements X^ are described by 

p i = j (mod N) (4-1) 


Equation (4-1) is the same as equation (2-2) as described previously in 
section II. A 5-ordered vector for n = 8 and N = 11 is shown in figure 4(a) 
From this definition of p-ordered vector, it is known from number theory that 
the linear congruence in equation (4-1) has a unique solution for i if and 
only if gcd (p,N) = 1, and the equation can be solved by the theory of primi- 
tive roots and indices. For this reason, Swanson proposed an interconnection 
called a k-apart interconnection network and is defined as follows: 

Def init ion . N registers are interconnected with a k-apart interconnec- 
tion, k = 1,2, . . . , N - 1, if the content of register kj 
(mod N) can be transferred directly to register j . The 
notation for such an interconnection is 


Reg (kj mod N) Reg ( j ) (4-2) 


j0123456789 10 

(a) ^ ^ [ s] [ji~~ [j] [j] [e] [4]— [2 ] 


8-ELEMENT 5-ORDERED VECTOR IN REGISITERS 



0 

a 

H3 0 GO GO 0 IB 

AFTER FIRST ROUTING 

13 

□ 

a 

(C) 

a 

a 

a a a a a a 

AFTER SECOND ROUTING 

13 

a 

a 

(D) 

QD 

S3 

□ 0 H3 0 HI 0 

AFTER THIRD ROUTING 

a 

□ 

a 

(E) 

HI 

m 

m is a in in hi 

AFTER FOURTH ROUTING 

□ 

□ 

a 


Figure 4.- Unscrambling a 5-ordered 
vector with a 2-apart intercon- 
nection of 11 registers. 


The problem confronted now is to 
determine the value of k. As with the 
definition of p-ordered vectors, k 
must be restricted to be relatively 
prime to N so that all p-ordered vec- 
tors can be unscrambled in, say, m 
routings. When a vector contained in 
the registers is to be routed along the 
interconnection paths, all registers 
transfer their content simultaneously. 
If the registers contain a p-ordered 
vector with p - k, then one transfer 
is sufficient to unscramble the vector 
to a 1-ordered vector. In order to 
meet these requirements, k is necessar- 
ily a primitive root of N as will be 
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described later. With this information, an interconnection network for N = 11 
can be constructed using k = 2 since 2 is a primitive root of 11. From 
equation (4-2), the calculation for the connections is shown in table 1, and 
the physical connections are shown in figure 4(a). The unscrambling of a 
5-ordered vector requires m = 4 routings, and this is calculated from 


TABLE 1.- INTERCONNECTION FOR 
2 -APART NETWORK (MOD 11) 


2 m = 5 (mod 11) , m = 4 


j 

2j (mod 11) 

Connection 

0 

0 

0 + 0 

1 

2 

2 + 1 

2 

4 

4+2 

3 

6 

6+3 

4 

8 

8+4 

5 

10 

10 + 5 

6 

1 

1 + 6 

7 

3 

3+7 

8 

5 

5+8 

9 

7 

7+9 

10 

9 

9 + 10 


The step-by-step process of this 
unscrambling is shown in figure 4. It 
is interesting to note how p changes 
after each routing. Since it requires 
a total of four routings, the progres- 
sion of p with respect to m is 


m = 0 1 2 3 4 

p = 5 8 4 2 1 


(4-3) 


which is a progression of the powers of 
k - 2. After the (m - l)th routing, 
the vector always reduces to k-ordered; 
thus, a p-ordered vector, with p = k, 
can be unscrambled in one more routing. 


From the discussions of p-ordered vectors in section II and cyclic groups 
in section III, N is necessarily a prime so that all p-ordered vectors, for 
p = 1,2,3, . . . , N - 1, can be unscrambled. Since k is a primitive root of 
N, then the interconnect ions generated by equation (4-2) are a cyclic group 
of order N - 1. The resultant interconnection will unscramble any 
p-ordered vector in m routings according to 


k m = p (mod N) (4-4) 

With these discussions, a theorem can now be formulated. 

Swanson theor em: If N registers are interconnected with a k-apart 

interconnection K e M^, then a p-ordered vector, 
p e M^j, contained in these registers can be converted 
to a 1-ordered vector if and only if p is an element 
of the cyclic group generated by k. 

The proof of this theorem is simply the fact that after m routings through 
the network, the contents of the registers are given by 

Reg [j] = (p*" 1 ^) j (mod N) (4-5) 

for j = 0,1,2, . . ., N - 1 and p -1 is the multiplicative inverse of 
p modulo N. In order for the vector to become 1-ordered after some number of 
routings, m from equation (4-5) must satisfy 
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A value for m satisfies this equation for all values of j if and only if 
m satisfies 


p” = 1 (mod N) (4-7) 

or 

k m = p (mod N) (4-8) 

From this, it is clear that the p-ordered vector can be unscrambled if and 
only if p is an element of the cyclic group generated by k. Further- 

more, k must necessarily be a primitive root of N so that all values of p, 
for p = 1,2, . . . , N - 1, can be generated by equation (4-8). For an arbi- 
trary value of p between 1 and N - 1, the maximum value of m in equa- 
tion (4-8) is N - 2, which is the maximum number of routings required to 
unscramble a p-ordered vector. As an example (TN case), let N = 521 and 
k = 3, then 3 519 h 174 (mod 521). 

In summary, it can be seen from the theorem that if N is a prime and 
k is a primitive root of N, then M N generated by k is a cyclic group. 

The order of is N - 1 and the elements are 1,2, . . . , N-l. All 

p-ordered vectors, for p - 1,2, . . . , N - 1, can be unscrambled to a 

1-ordered vector by a single k-apart interconnection network. This unscram- 
bling process is effected by routing the elements of the vector through the 
network m times (see, e.g., fig. 4) according to 

k m - p (mod N) (4-9) 

The maximum number of routings is m = N - 2 . After the (m - l)th routing, 
the vector always reduces to k-ordered so that the vector can be reduced to 
a 1-ordered vector after the mth routing. Prior to the mth routing, the 
reduction of p after each routing is progressing in the powers of k (see, 
e.g., eq. (4-3)). 

V. TRANSPOSITION NETWORK DESIGN: k-APART NETWORK IMPLEMENTATION 


As described in section IV, the Swanson k-apart network can be used to 
unscramble an n-element p-ordered vector stored in N registers n ^ N, with 
a maximum of N - 2 routings. If N is large, like 521 in the FMP, then the 
unscrambling can be a long delay. To overcome this problem, several k-apart 
networks can be connected in series to perform the N - 2 routings in the 
progression of the powers of k as suggested in equation (4-3) . In electri- 
cal engineering this technique is called^logarithmic compression, which is 
similar to the technique of computing n in 21 og 2 (k + 1) steps as shown in 
appendix A. In the sequel, it will be shown that the N - 2 routings can be 
reduced to L = log 2 N routings by using L k v -apart networks, where v = 2^ 
for i = 0,1,2, . . . , L - 1. 


Let the integer N be represented by an L-bit binary number. At most, 
when N is at its maximum value, N is 

N = 2 l_1 + . . . + 2 1 + 2° = 2 L - 1 (5-1) 


or 

N - 2 = 2 L - 3 


(5-2) 


This gives 


L = log 2 (N + 1) 


= flog 2 N] 


g.n N 
£n 2 


for N >> 1 


(5-3) 


for N = 521, L = 10, and for N = 11, L = 4. In equation (5-3), it is sug- 
gested that the N - 2 routings can be reduced to L = log 2 N routings by 
using L k-apart networks connected in series in L levels. At level i, 
the network will perform either a straight-through connection (no routing) or 
a routing by the amount equal to k v (mod N) , where v = 2 1 for 
i = 0,1,2, . . . , L - 1. The selection either "to route" or "straight 
through" is controlled by the binary bit 2 1 of m, where m is calculated 
as 


k m = p (mod N) 


(5-4) 


Let U(x) denote the k-apart network at level i for i = 0,1, . . ., 
L - 1. Then the L levels of networks are as follows: 


U(x) 

U(x) 


U(x) 

As an example, for N = 

U(2) 
U(4) 
U (5) 
U(3) 


= k 2 ° (mod N) 

= k 2 1 (mod N) 


0 L-1 

= k 2 (mod N) 

11, k = 2 , the four 

= 2-apart 
= 4-apart 
= 5-apart 
= 3 -apart 


1st level 
2nd level 


Lth level 

levels of networks are: 

1st level 
2nd level 
3rd level 
4th level 


Note that the combinations of the numbers 2, 3, 4, and 5 can be selected 
appropriately to form any number from 2 to 10; and therefore, it can be used 
to unscramble any p-ordered vector for p = 1,2, . . ., 10. For the trivial 
case when the vector to be unscrambled is a 1-ordered vector, then all four 
levels will be selected as "straight through," because m is zero in this 
case, as calculated by equation (5-4). This concept of using L levels of 
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k-apart networks to implement the TN is shown in figure 5. The implementation 
of a TN for N = 11 and k = 2 is shown in figure 6 with m set to 9 for 
unscrambling a 6-ordered vector, since 2 9 = 6 (mod 11). For the case when 
N = 521 and k = 3, the 10 levels of networks are U(3), U(9), U(16), U(256), 

U (411) , U (117 ) , U (143) , U (130) , U(228), and U(405). 


VI. TRANSPOSITION NETWORK DESIGN: BARREL SWITCH IMPLEMENTATION 


The implementation of the Swanson Network using L = log 2 N k-apart net- 
works is a significant step in increasing the unscrambling speed by reducing 
the number of routings m from N - 2 to log 2 N. The cost for this gain in 
speed is the increase in hardware complexity, where the number of k-apart net- 
works is increased from 1 to log 2 N. There is, however, an alternative to the 
implementation. By a close inspection of the k-apart network, it can be 
observed that there is a uniform shift pattern in its input to output connec- 
tions (see, e.g., fig. 6). For this reason, it is reasonable to expect that 
a uniform-shift network (ref. 18), com- 
monly called a barrel switch, can be 
used for such an implementation. In 
the sequel, it will be shown that a 
single level of barrel switch, plus a 
fixed wiring pattern connected accord- 
ing to k m = p (mod N) and its inverse 
can be used to replace all L = log 2 N 


p-ORDERED VECTOR 




1 ORDERED VECTOR 


Figure 5.- Implementation of Swanson 
network using L levels of k-apart 
networks. 


Figure 6.- TN for N = 11 and k = 2, 
showing the unscrambling of a 
6-ordered vector for the routing 
amount M = M 3 M 2 M-[Mq = 1001, A 
"0" means straight through and a 
"l" means routing. 
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levels of k-apart networks. The control for the barrel switch is simple where 
m simply controls the switch shift amount. 

A barrel switch is a combinational logic network that can perform a uni- 
form shift of an incoming vector by a fixed number of positions. The shift 

direction can be either left or right with options of either end-off or end- 

around. The design description of barrel switches is almost all contained in 
patent literature. 

In "open" literature, design descriptions are few and far between. There 
are a few pages in the book by Burroughs Corporation in 1968 (ref. 19), the 
Signetic 8243 8-position scalar in 1970 (ref. 20), a few pages of description 
in 1970 by Davis (ref. 21), and a design note in 1972 by Lim (Ref. 22). In 

this section, barrel switch design is not of concern, and it will be treated 

simply as a uniform shift network, in particular, a left end-around uniform- 
shift network. 

In the description of p-ordered vectors in section II, the vector X is 
p-ordered if the positions of its elements are described by 

X ± = j = pi (mod N) (5-1) 

After X is unscrambled by the network, the result is a 1-ordered vector, or 
p = 1 in equation (5-1). That is, 

X ± = i (mod N) (5-2) 

or 


X. = i (5-3) 

r 

for i = 0,1,2, . . . , n - 1, and n < N. The objective of this section is to 
show that a Swanson network implemented by one row of barrel switch, plus a 
fixed unscrambling wiring pattern and its inverse, can be used to reduce 
equation (5-1) to equation (5-3). 

From the Swanson k-apart network discussion in section IV, 

k m = p (mod N) (5-4) 

where k is the primitive root of N and m is the number of routings 
required to unscramble p. Combining equations (5-1) and (5-4), the result is 

i k m = X (mod N) (5-5) 

In the description of number theory in section III, equation (5-5) is a 
binomial congruence and it can be solved by the theory of indices. Taking 
indices on both sides of equation (5-5), the result is 
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(5-6) 


or 


ind^i + m ind^k = ind^X^ (mod cj)(N)) 


ind^i = ind^X^ - m ind^k (mod (J>(N) ) 


(5-7) 


since 


then 


k m = p (mod N) 


m ind^k = ind^p (mod cj>(N)) 


Combining equations (5-7) and (5-8), the result is 


(5-8) 


ind, i = ind. X . 
k k i 


ind p (mod cj> (N) ) 

K. 


(5-9) 


or 


ind^i = ind^pi - ind^p (mod 4> (N) ) 


(5-10) 


Equation (5-10) is the key equation in demonstrating that the barrel 
switch indeed can implement the Swanson network. The implementation of this 
key equation is shown in figure 7. The explanation of figure 7 is as follows: 

1. The input to the unscrambling wiring pattern W, point-A, is a 
p-ordered vector X. = pi (mod N) . 


2. In order to obtain X^ = (ind^p)i (mod <j>(N)) at point-B, W is wired 
according to the index of p of the 


unscrambling equation 

k m = p (mod N) 

In this equation, m = ind^-p; and thus, 
the wiring is from p to m for 
p = 0,1,2, . . . , N - 1 . 

3. At points B and C, the input 
of X^ = ind^-pi (mod <f) (N) ) is shifted 
m times left end-around by the barrel 
switch to obtain the key equation 


X_^ = ind^i (mod <j>(N)) 

= ind^pi - ind^p (mod cj> (N)) 


X (p-ORDERED) 


— 0 x , 3Pi 


(MOD N) 


UNSCRAMBLING WIRING 
PATTERN 


FIXED CONNECTION IS P -*• M 
" OF l< M - P (MOD N) 


X, = (ind K P) i 


BS 

BARREL SWITCH 
(LEFT END-AROUND SHIFT) 


[MOD 0 (N)] 

M (SHIFT M TIMES) 


— 0 


X. s ind K Pi - ind K P [MOD 0 (N)] 
' = ind K i [MOD 0 (N)] 


vr 1 

(INVERSE OF W) 


^ Xj = i (MOD N) 


X (1-ORDERED) 


Figure 7.- Barrel switch implementation 
of Swanson network according to the 
key equation, equation (5-10) . 
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Note that m = ind. p and left shift is minus. It can be shown that 

k 


as follows. From 


it follows that 


or 


m = ind^p (mod <f) (N) ) 


k m = p (mod N) 


m ind k = ind p (mod cf) (N) ) 

K. K. 


m = ind^p (mod <J>(N)) 

since ind k =1 by property (3) of the index theorem. 

1C 


4. The output at point-D is obtained by routing 


X . = ind. i 
l k 

through W-l, which is the inverse of 
W. This inverse wiring pattern will 
obtain i from ind^i, so that the 
desired output of 

X. = i (mod N) 
i 

is obtained. 

To illustrate the above steps fur- 
ther, an example for N = 11 and k = 2 
is shown in figure 8. The wiring pat- 
tern W is constructed according to 

2 m = p (mod 11) 

for p = 1,2, . . ., 10. The wiring 
direction is from p to m because m 
is the index of p. Also shown in 
figure 8 is the unscrambling of a 
p-ordered vector with p = 6, the same 
as in figure 6. 

At this point, it should be pointed 
out that the description of unscrambling 
p-ordered vectors has assumed that the 
offset is zero. That is, Xq of X is 
always in memory module 0 or in regis- 
ter 0. If the offset is not zero, then 


(mod 0 (N) ) 



Xp X, x 2 x 3 x 4 x 5 Xg x 7 *1 *2 *3 


Figure 8.- Illustration of figure 7 
for N = 11, k = 2, and p = 6. If 
offset of X is not zero, X can 
be preshifted by another barrel 
switch prior to input of W. 
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a preshift of X is required. This can be accomplished by a barrel switch 
prior to the input of the TN. 


VII. PROGRAMMING AND APPLICATION 


The programming of the TN, for either the L-level k-apart networks or the 
barrel switch implementation, is identical and very simple. It involves only 
two parameters: the offset and the shift amount m. These two parameters are 

available from the compiler of the FMP since the compiler is responsible for 
laying out the three-dimensional data set in the EM modules. The offset of a 
p-ordered vector X, as described in section II, is a variable and is simply 
equal to the memory module address of the first element of X. The param- 
eter p is also a variable and is simply equal to the difference of the 
memory module addresses of X and X. of X, for i = 1,2, . , . , n. Note 

that the subscript i of X 1 here is changed to "1 to n n because in FORTRAN 
matrix, the counting starts with 1, not 0. Once p is known, m can be 
obtained as the index of p according to k = p (mod N) . 

The primary application of the TN is for unscrambling p-ordered vectors 
that arise naturally from data allocation in memory modules. The secondary 
application is for data communication between processors through the EM mod- 
ules, since there are no direct connections between processors of the FMP as 
shown in figure 1. 

In this section, the derivation of the memory module addresses and a 
brief description of the TN application to unscramble three-dimensional data 
sets of turbulent flows are presented. The presentation here is for illustra- 
tive purposes only. In practice, the actual techniques used may be different. 
Also, the application of the TN to perform a perfect shuffle is described 
briefly without proof. 


PROGRAMMING 


Before proceeding with the memory module address calculation, the defini- 
tions of a few notations are important. Let a three-dimensional data set be 
represented by D(I,J,K), where I,J, and K are the maximum values in the 
three dimensions. It follows that D(I) and D(I,J) are one-dimensional and 
two-dimensional data sets, respectively. If the data layout method is row 
major order, then I, J, and K represent the row, column, and file, respec- 
tively. If, however, the data layout method is column major order, then I, J, 
and K represent the column, row, and file, respectively. As an example, the 
data set D(I,J,K) in figure 9 is D(3,5,7) for row major order and is 
D(5,3,7) for column major order. In this paper, only the column major order 
data layout method is of interest because of FORTRAN compatibility. For this 
method, the data layout (similar to figure 3) for N = 11 memory modules is 
shown in figure 10. 
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I (COLUMN MAJOR) 
J (ROW MAJOR) 


113 



112 

122 

111 

121 

131 

211 

221 

231 

311 

321 

331 

411 

421 

431 

511 

521 

531 


132 
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532 


117 127 


116 126 136 

236 


114 124 


123 133 

233 
333 
433 
533 


134 

234 

334 

434 

534 


336 

436 

536 


J (COLUMN MAJOR) 
I (ROW MAJOR) 


137 

237 

337 

437 

537 


Let A^(i,j,k) denote the memory 
module address of an element in 
D (I , J , K) . It follows that A^(i) and 
A^(i,j) are memory module addresses for 
D (I) and D(I,J), respectively. In 
figure 10, the element 111 is the very 
first element of D(l,J,K), and it is 
stored in memory module Mg. For this 
reason, the starting address offset 
Aq, in this case, is 8. The address 
offset Aq, sometimes, is also called 
the base address. If the element 111 


Figure 9.- A three-dimensional data 
set: for row major, 

D(I, J,K) = D (3 , 5 , 7) , for column 
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Figure 10.- Column major order 
storage of a three-dimensional 
data set D(I,J,K) - D(5,3,7). 


higher numbered modules. In this case 
obviously, is 


is stored in Mq, then Aq = 0. It 
should be noted that the offset Aq 
should not be confused with the offset 
of a p-ordered vector, which is equal 
to A^(i,j,k) of the first element in a 
vector. In the sequel, the address 
A^(i,j,k) will be derived systemati- 
cally, starting with the one-dimensional 
data set D(I), and figures 9 and 10 
will be used as references. 

Let a one-dimensional array, or 
vector, X have elements X. for 

5 i 

i = 1,2, . . . , n. The first element 
of X, Xi, is stored in a memory module 
located A Q memory modules away from 
Mq. Thereafter, the elements of X 
are stored in memory modules with suc- 
ceeding elements at progressively 
, the memory module address of X^, 


A^(i) = A Q + (i - 1) (mod N) 


(7-1) 


In a two-dimensional data set D(I,J), the memory module address A^ ( i , j ) 
of the element X^j (assuming column major) is 


A^(i» j ) = A^(i) + I(j - 1) (mod N) 

= A + (i - 1) + I ( j - 1) (mod N) (7-2) 

The proof of equation (7-2) is by the observation that the term Aj^(i) 
is simply the address of X. in a column in D(I,J). For the first column, 
the term I ( j - 1) = 0 because j = 1. Therefore, A M (i) can also be inter- 
preted as the address of X^ in the first column. After X. in the first 
column is exhausted, the address of X. in the second column is incremented 
by I, which is the column dimension. In the third column and thereafter, the 
address of X. is incremented by 21 , 31 , . . . , I (j - 1) , in that order. 

This concludes the proof. 
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In a three-dimensional data set 
A^(i,j,k) of the element 


D(I,J,K), the memory module address 


X. (assuming column major) is 
1 J k 


Aj^(i, j ,k) = A^i.j) + IJ (k - 1) 


(mod N) 


= A + (i - 1) + I(j - 1) + IJ (k - 1) (mod N) (7-3) 

The proof for equation (7-3) is similar to that for equation (7-2) . The 
term IJ(k - 1) is the increment for the planes. For the first plane, 

IJ(k - 1) = 0 because k = 1. Thereafter, the plane is incremented by 
IJ, 2IJ, . . I J (k - 1). Referring to figure 10, the following examples 
will illustrate the application of equation (7-3) . 


Example 7-1 . The row vector consists of row 2 in plane 1 and row 2 in 
plane 2 is (211, 221, 231, 212, 222, 232). In figure 10, 
the layout of this vector is 

01 2 34567 8 9 10 

* 232 212 221 * * * 222 231 211 * 

To fetch this vector, the offset and the p, and hence m, 
are required. These two parameters can be obtained by 
calculating A^(2,l,l) and A^(2,2,l) as follows: 

A m ( 2,1,1) = 8 + (2 - 1) + 5(1 - 1) + 5 x 3(1 - 1) 

= 9 (mod 11) 

A m ( 2,2,1) = 8 + (2 - 1) + 5(2 - 1) + 5 x 3(1 - 1) 

5 3 (mod 11) . 

The offset and p are then, 

offset = A^(2,l,l) = 9 

p = 1^(2, 1,1) - V2.2.DI- 1 = 5 

The minus 1 for the p calculation is necessary because 
the memory module numbering starts with 0, not 1. 


Example 7-2. The 
537) 

file vector consists of 
in figure 10 is 

(531, 

532, 

533, 

534, 535, 536, 

0 

1 2 3 4 5 

6 

7 

8 

9 

10 

531 

534 537 * 532 535 

* 

* 

533 

536 

* 
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The offset and p are: 
offset = ^(5,3,1) 

= 8 + (5 - 1) + 5(3 - 1) + 5 x 3(1 - 1) 

= 0 (mod 11) 

P = |0 - ^(5,3,2)! - 1 = 5 - 1 = 4 

The development of the three-dimensional address equation A^(i,j,k) can 
be extended to n dimensions. For notational convenience, let the 
n-dimensional address equation be A^(i^,i 2 , . . i n ) . The progression of 
this development, starting with n = 4, is (all equations are modulo N) : 

( i i , i2 » i 3 > i4 ) - ( i j , i2 » i 3 ) ^ 1 ^*2-^3 (-*-4 ” 

A M (ii,i2»i3» i 4.i5) = A M (i 1 ,i 2 ,i 3 ,i4) + Il I 2 I 3 I 4(i5 - D 


AjyjCii ,i2» • • •» i n > = A M (ii,i 2 , 


W + IlT2 * * * I n-l (i n ‘ 1} 


Once the offset and the p parameters are obtained, the CU obtains 
m from k ra = p (mod N) in some manner, probably by table lookup. The CU 
effects this control over the TN and connects the 512 processors to 521 EM 
modules in some p-order. From this discussion, it should be clear that a 
processor has absolutely no control over its EM module connection. For this 
reason, any EM access by a processor must be coordinated with the CU and also 
wait for the other 511 processors to come to a stop or a synchronization point 
before such an access can be executed. These wait and synchronization events 
can cause a degradation in computation performance. There are, however, 
exceptions to these restrictions. The discussion on these exceptions, of 
course, is beyond the scope of this paper. After the processor-memory connec- 
tion is made by the TN, each processor is responsible for generating the 
address A(i,j,k) to access a location within its connected EM module. This 
address can be obtained from A (i,j,k), as follows: 


A(i, j ,k) 



(7-4) 


where |A^(i,j,k)| denotes the true value of A^(i,j,k) without the congru- 
ence. The following examples will illustrate the application of equa- 
tion (7-4). 

Example 7-3 . In example 7-1, the element 211 is in Mg. Its location 
in Mg is 


A(2,l,l) 


_ 9 _ 

11 


= 0 
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The element 211 is in M 


Its location in M 3 is 


3 * 

A(2,2,l) 


14 

11 


1 


Example 7-4 . In example 7-2, the element 531 is in M . Its location 
in Mq is 


A(5 ,3,1) = 


_22 

11 


2 


The element 537 is in M 2 . Its location in M 2 is 


A(5 ,3,7) 



This completes the discussion on TN programming. 


APPLICATION 


The primary application of the TN, as mentioned earlier, is for unscram- 
bling p-ordered and pq-ordered vectors. It was shown that these vectors arise 
naturally from the storage allocation of three-dimensional data sets. In 
practice, these data set sizes for the present Reynolds averaged Navier-Stokes 
flow codes are typically (100 x 100 x 100) and (200 x 50 x 100) . The compu- 
tational methods to solve these equations are often specially split, which 
requires that memory accesses must be made in all three directions. In each 
of the I, J, and K directions, two memory access patterns are required. 

These two patterns are the row access and the column access. This gives a 
total of six possible access patterns. Consistent with earlier definitions, 
let D(I,J,K) denote the access pattern when computing in the K direction 
with column access. Then, with respect to the six possible permutations of 
the I, J, and K directional indices, these six access patterns are: 


D (I , J , K) 
D ( J, I , K) 
D (I, K, J) 
D (K, I , J) 
D(J,K,I) 
D(K,J,I) 


K direction, 
K direction, 
J direction, 
J direction, 
I direction, 
I direction, 


column access 
row access 
column access 
row access 
column access 
row access 


The above six access patterns are also presented in reference 2, appen- 
dix A, in a somewhat cryptic manner. In each of these six access patterns, 
the TN must be programmed to unscramble p-ordered and pq-ordered vectors and 
also, vectors that have memory access conflicts. 


The TN can also be programmed to perform a perfect shuffle (ref. 23). 
The following description is given without proof. Let X be a 1-ordered 
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I 

vector of n = 2 elements, where l is a positive integer. It is assumed 
that the elements of X are stored in N memory modules with succeeding 
elements at progressively higher numbered modules. To perform a perfect shuf- 
fle on X, the TN can be programmed as follows: 

1. Treat X as a p-ordered vector. Unscramble X with p = n/2. The 
resultant vector is a pq-ordered vector. 

2. Shift the pq-ordered vector as many times as necessary to form the 
perfect shuffle. 

An example for n = 2 3 and N = 11 is shown in figure 11. In this 
figure, the TN is set to m = 2, corresponding to p = 4. The unscrambled 
vector is a pq-ordered vector with p = 1 and q = 1. If this pq-ordered 
vector is rearranged three more times by appropriate left shifting, the result 
obtained is (XoX 4 X 1 X 5 X 2 X 6 X 3 X 7 ) , which is a perfect shuffle of eight elements. 



CONCLUSION 


TN SET 
FOR M = 2 
CORRESPONDING 
TO P = 4 


A tutorial description of the 
Burroughs NASF TN has been presented. 
Basically, the TN is a bidirectional 
programmable combinational logic net- 
work that connects a 521-module EM to 
an array of 512 processors, where 521 
is selected as the smallest prime 
number greater than 512. The TN is 
one method of solving the traditional 
memory-processor connection problem in 
parallel array processors. The primary application of the TN is for unscram- 
bling p-ordered and pq-ordered vectors. The TN, if programmed appropriately, 
can also be used to perform a perfect shuffle. The advantage of the TN is 
simplicity and ease of control. The disadvantage of the TN, in the general 
sense, is low performance. 


Figure 11.- TN used to perform a 
perfect shuffle of 8 elements 
by first converting X to a 
pq-ordered vector. 


It was shown that p-ordered and pq-ordered vectors arise naturally from 
storage allocation of two-, three-, and n-dimensional data sets in the EM of 
the FMP, which is similar in architecture to that of the ILLIAC IV array 
memory system. If the vector is p-ordered, the TN can unscramble it to a 
1-ordered vector in one cycle. If the vector is pq-ordered, or some other 
permutation-ordered, several cycles are required for unscrambling. Unlike 
other more complicated and powerful permutation networks, the TN cannot, in 
general, unscramble non-p-ordered vectors in one cycle. This can be a 
disadvantage . 

The programming of the TN is relatively simple when compared to the other 
permutation networks. This is an advantage. Two programming parameters, the 
offset and the shift amount, are controlled by the CU. Since the compiler is 
responsible for laying out the data set in EM, the compiler can furnish the 
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information to the CU for calculating these two parameters. In this context, 
the CU has absolute control over the TN. For this reason, any EM access by a 
processor must be coordinated with the CU and also needs to wait for the other 
511 processors to come to a stop or a synchronization point before such an 
access can be executed. These wait and synchronization events can degradate 
the performance of a computation. This can be a disadvantage. 

The design and the implementation of the TN is simple. This is an advan- 
tage. The design is based upon the Swanson network (ref. 8). The Swanson 
network is a k-apart interconnection network constructed according to the 
theory of cyclic groups and the theory of primitive roots. The barrel switch, 
it turns out, can be used to implement the Swanson network, and hence the TN. 
The implementation consists of a single level of barrel switch plus a fixed 
wiring pattern and its inverse. The result is a very simple network. 

In the FMP of the NASF, whether or not the TN should be used to solve the 
traditional memory-processor connection problem is a matter of complexity 
versus flexibility. The TN is simple but inflexible. Other permutation net- 
works, such as the Benes network (ref. 4), are somewhat more powerful and 
more complex. The research and development of these networks are technically 
challenging and also are important in advancing the art of parallel computing. 
It is hoped that this effort will be encouraged and continued. 
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APPENDIX A 


A Method to Compute 

In general, the computation of N^, where both k and N are positive 
integers, requires k multiplications. The method to be presented, an old 
and undocumented method in number theory, requires at most 

2 log 2 (k + 1) = 2 [ £n(k + l)/£n 2] 

multiplications. To compute N , express k as an n-bit binary number as 
follows: 


k = k 0 2° + k 1 2 1 + . . . + k R _ 2 n_ 1 


(A-l) 


where k. = 0 or 1, i = 0,1,2, . . . , n - 1. Next, 


N k = N 


^kg2 2 . +k^ ^2 0 


= N 


k 0 2° k^ 1 

N 


. . . N 


W 2 


n-1 


(A-2) 


In equation (A-2) , N 1 can be obtained in n, n << k, multiplications 
plus the preparation of a powers-of-2 table, at most n entries. The value 
of n can be equated to k by applying the identity 


2 ° - 1 = 2 ° + 2 1 + . . . + 2 n_1 


(A-3) 


to equation (A-l). In equation (A-l), there are at most n terms. The 
upper bound for k is 

k = 2 n - 1 (A-4) 


or 


n = log 2 (k + 1) (A-5) 

Thus, the number of multiplications required to compute N is 

2n = 21og 2 (k 4- 1) (A-6) 

n multiplications for equation (A-2) , and n multiplications for preparing 
the powers-of-2 table. This method can be summarized as follows; 

2 2X for i = 0,1,2, . . . , up to about 
(A-2) for those terms for which 


1. Prepare a table by computing 
n-i ’ 

2. Form the product per equation 
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Example 1 . Compute 2 


20 

4 3 2 1 0 
1 . 20 = 1 0 1 0 0 

10 12 3 4 

2 2± 2 4 16 256 65536 

4 2 

2. 2 20 = 2 2 2 2 = 65536 x 16 = 1,048,576 

Example 2 . Show that 2 260 = 1 modulo 521 

876543210 
1. 260 = 100000100 

10123 4 5 6 7 8 

2 21 2 4 16 256 411 117 143 130 228 

(mod 521) 

8 2 . 

2. 2 260 = 2 2 2 2 = 228 x 16 = 3648 

3648 = 1 modulo 521 
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